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Abstract 

We present a novel technique to remove spurious ambiguity from transition systems for dependency parsing. Our 
technique chooses a canonical sequence of transition operations (computation) for a given depe ndency tree. Our 
technique can be applied to a large class of bottom-up transition systems, including for instance iNivrd l2004ll and 
lAttardill2006ll . 
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1 Introduction 

In parsing, spurious ambiguity refers to ambiguity in a grammar that occurs because several derivations exist for an 
identical syntactic analysis. When the grammar is enriched with probabilities, the existence of spurious ambiguity im- 
plies that the statistical model is defined over derivations, a more fine-grained version of the actual syntactic structures 
of interest. The probability of a syntactic structure then becomes the marginalized probability over all derivations that 
map to that syntactic structure. 



Spurious ambiguity can exis t in various gramm atical models such as c ombinatory categorial grammars I S teedman , 



197511 . data-oriented parsing llBodll 199211 and transition-based dependency 



2001], tr ee adjoining grammars IJoshi et al 
parsing jNivrei l2005ll . 

While models with spurious ambiguity are statistically more expressive than models without spurious ambiguity!]] 
an obstacle exists in the need to marginalize out derivations in order to compute the total probability of a syntactic 
structure, which is necessary for training and decoding w ith such models . For many models with spurious ambiguity, 
it is in fact provably NP-hard to do such marginalization llSima' aru. I1996I1 . 

Various heuristics exist to sidestep the need for marginalization. For example, during decoding, one can find the 
highest-scoring derivation instead of the highest-scoring structure. Under the assumption that most of the probability 
mass of a given syntactic structure is concentrated on a single derivation, this alternative decoding can be successful. 
However, this assumption often fails when the probability mass is evenly divided for one syntactic structure but 
concentrated on a single derivation for another. Even when marginalization can be done efficiently, the likelihood 
of observed data often becomes non-convex, which is undesirable for training the model because of the local optima 
problem. For these reasons, it is preferable in most cases to eliminate spurious ambiguity. 

In this paper, we focus on eliminating spurious ambiguity that exists in transition-based dependency parsing. Am- 
biguity arises because several sequences of shift and reduce operations (which assemble a derivation) could yield 



By this we mean that there are distributions over syntactic structures which can be obtained using models with spurious ambiguity but can not 
be obtained using models without spurious ambiguity. 
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identical dependency trees. The transition-based parsing literature has implicitly tackled the issue of spurious am- 
biguity by defining an oracle which, after receiving a dependency tree as input, outputs a unique derivation for that 
tree based on a canonical ordering of the transition operations. This oracle is then used on the training data (pairs of 
sentences and dependency trees), yielding new training data (pairs of sentences and shift-reduce d erivations) to train 



multi-class classifiers that decide at each transition step which operation to take IINivre et aU 1200411 . 



Rather than eliminating spurious ambiguity from the model, this heuristic creates a bias through training to prefer 
certain derivations for a given dependency tree when doing decoding. In addition, as we discuss in |5] some of the ex- 
isting oracles for supervised dependency parsing are based on incomplete heuristics (which are often undocumented). 

We present a more principled approach to eliminate spurious ambiguity in transition-based dependency p arsing . 
We fi rst define a wide class of bottom-up tra nsition systems , which includes the arc-standard transition system iNivrel 
2008] as well as the transition system from Attardi [2006]. One could also define a transitio n-based parser using a 



strateg y which is a hybrid between the arc-standard strategy and the easy-first strategy from iGoldberg and Elhadad 
in which a set of shift actions would need to be taken before a reduction decision is made affecting elements 
at some deeper position on the stack: this decision can depend on the "easiness" of the reduction. Such a parser can 
be easily encapsulated into our framework. 

We then provide a general technique to enrich the transitions of these systems in order to remove spurious ambi- 
guity while maintaining the completeness of the enriched system with respect to the original. Each tree is associated 
with a single derivation, which is a sequence of shift and reduce operations such that reduce operations are performed 
as soon as possible, and conflicts between several reductions are resolved by first attaching dependents that are closer 
to the current focus point of the parser (top of the stack). This is coherent with psycholinguistic models postulating 
that humans tend to process local attachments first IGibson , 2000]. 

Our approach eliminates ambiguity from a declarative transition system. However, it is extensible to a decoding 
algorithm as w ell. The transition systems we introduce can be made probabilistic in a manner similar to the one 
that appears in ICohen et al.l 11201 ill . Then, a dynamic programming algorithm for these probabilistic systems can 



be deri ved so that one can identify the highest sc oring derivation and compute the expectations of features in the 



model IIKuhlmann et all 1201 lL ICohen et all 1201 111 . Our removal of spurious ambiguity is efficient: the dynamic 



programming algorithm which is based on the transformed transition system has the same asymptotic complexity as a 
dynamic programming algorithm for the original transition system. 

Our original motivation was to construct a probabilistic model for transition-based dependency parsing, such that 
a unique (canonical) derivation exists for each dependency tree. This avoids the computational complexity involved in 
marginalizing derivations. Removal of spurious ambiguity in such a case has to be done at the level of the transition 
system and not at the level of a tabular method simulating the system or at the level of the resulting parse forest: 
removing undesired derivations from the chart does not tell us how to set transition probabilities in the original system 
in such a way that the probability mass of each dependency tree is allocated to a single canonical derivation. 

The rest of this paper is organized as follows. We provide an overview of transition-based dependency parsing 
in $2] We then describe the main details of the spurious ambigu ity removal tec hnique in Ej3] We provide proofs and 
formal analysis in Sj4] We apply our technique to the parser from lAttardil 1 2006 1 and run some experiments in §5\ We 
describe other applications of our technique in $6] and we conclude with an open problem in f]7] 



2 Transition-Based Dependency Parsing 

In this sectio n we briefly introduce the basic definitions for transition-based dependency parsing; we refer the reader to 
iNivrd ll2008ll for a more detailed presentation. We also define the class of transition-based parsers which is investigated 
in this paper. 

2.1 General Transition Systems 

Let E be an input alphabet and let w = cti ■ ■ ■ a n , n > 1, be the input string with a.j G S for each i with 1 < i < n. A 
dependency tree for w is a directed tree G = (V u , , A) where V w = {0, 1, . . . , n} is the set of nodes and A C V w x V w 
is a set of arcs. Each node encodes the position of a token in w, with being a dummy node used as an artificial root, 
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and each arc encodes a dependency relation between two tokens. We write i — > j to denote a directed arc E A, 
where node i is the head and node j is the dependent. 

A transition system for dependency parsing is a tuple S = (C,T, I,Ct), where C is a set of configurations, 
defined below, T is a finite set of transitions, which are partial functions t: C — 1 C, / is a total initialization function 
mapping each input string to a unique initial configuration, and Ct C C is a set of terminal configurations. 

A configuration is defined relative to input string w, and is a triple (a, j3, A). Symbols a and /3 are disjoint lists of 
nodes from V w , called stack and input buffer, respectively, and A C V w x 14, is a set of arcs. If t is a transition and 
Ci, C2 are configurations such that t(c\) = C2, we write ci h t C2, or simply ci h C2 if i is understood from the context. 

We denote the stack with its topmost element to the right and the buffer with its first element to the left. We indicate 
concatenation in the stack and buffer by a vertical bar. For example, for i 6 V w , a\i denotes some stack with topmost 
element i and i\j3 denotes some buffer with first element i. For 1 < i < n, denotes the buffer [i, i + 1, . . . , n]; for 
i > ra, pi denotes the empty buffer [] . 

A computation of S is a sequence 7 = crj, . . . , c m , m > 1, of configurations such that, for every i with 1 < i < m, 
Ci-i \~tt ^ for some ti G T. In other words, each configuration in a computation is obtained as the value of the 
preceding configuration under some transition. A computation can be uniquely specified by its initial configuration cq 
and the sequence t±, . . . , t m of its transitions. Thus we will later denote 7 in the form (cq; ti, . . . , t m ). 



2.2 Spurious Ambiguity 



A computation 7 = cq, . . . , c m is called complete whenever Co = for some input string w, and c m <G Ct- For a 
complete computation 7 we denote as £^(7) the unique dependency tree consisting of nodes V w and all arcs in the final 
configuration c m . We say that a transition system has spurious ambiguity if, for some pair of complete computations 
7 and 7' with 7 7^ 7', we have ^(7) = D(j'). 

Informally, the existence of spurious ambiguity implies that there are at least two computati ons tha t deriv e the 
same depende ncy tree. Spurious ambiguity exists in various transition systems, such as those in iNivrd 8200411 and 
Attardil l200dl . 



Example 1. The well-known arc-standard transition system bv \Nivre\ ft2004l can be defined as follows: its initializa- 



tion function is I(ai ■ 
following transitions: 



■ a n ) = ([0], [1 • • • n], 0), its set of terminal configurations is Ct = ([0], [], A), and it has the 

shift : (a,i\p,A) h (a\i,f3,A) 
la : (a\i\j,/3,A)\-(a\j,/3,Au{j^i}) 
ra : (a\i\j\, /?, A) h (a\i,f3,A U {i -)■ j}) 

The two following complete computations for a string w = 010203 produce the same tree with arcs {0 — > 2, 2 — > 1, 2 — > 3}: 

f/J (I(w); shift, shift, la, shift, ra, ra); 

(ii) (I(w); shift, shift, shift, ra, la, ra). 

Therefore, this transition system has spurious ambiguity, caused by the fact that it allows words (in the example, a-i) 
to choose whether to collect a left or a right dependent first. 

We remark that while in the case of the arc-standard model spurious ambiguity is restricted to a certain set of 
permutations over sequences of operations, i.e., all derivations of a given syntactic tree consist of the same transitions 
in some permutation, this does not hold in the case of non-projective models. 



2.3 Bottom-Up Shift- Reduce Transition Systems 

Many of the transition systems for dependency parsing that have been proposed in the literature adopt a bottom-up 
strategy, meaning that they construct dependency trees starting from the leaves and finishing with the root, by always 
collecting all the dependents of a given node before assigning it as a dependent of another node. This includes for 
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instance the already mentioned arc-standard parser, and the non-projective parser of lAttardi l200dl . These parsers 
tend to present spurious ambiguity because, as in Example Q] the left and right dependents of a given no de can be 
collected in different orders. This is in contrast with parsers derived from the arc-eager model llNivrell2003h which are 
not bottom-up and instead impose a unique left-to-right order in which arcs must be constructed. 

Some bottom-up transition systems use reduce transitions that affect the buffer, but they can be cast in an alternative 
form in which all reductions involve only elements fro m the stack. This is done by considering the first element of 
the buffer as the topmost stack symbol, as discussed bv lCohen et al.l 11201 ill ; in this way reductions might take place 
between stack elements placed at positions deeper than the topmost one. The following definition captures the general 
form of such models. 



Definition A transition system is bottom-up shift-reduce if its initialization function is I{a\ ■ ■ ■ a n ) = 
its set of terminal configurations is C t = ([0], [], A), and its set of transitions consists of the following: 

(i) a shift transition sh of the form (a, i\/3, A) h (a\i, (3, A); 

(ii) a set of left arc transitions la p< _ g with p > q > 1, each of the form 

(a\i p \ip-i \ ■■■ \ii,0,A) \- (cr|i p _i| ■ • • \h,/3, A U {i q i p }); 

(iii) a set of right arc transitions ra p ^ q with p > q > 1, each of the form 

' ' ' Pi ^) !~ (°1*pl ' ' ' l*g+i 



[1 



Transitions in (ii) and (iii) above are called reductions. The degree of reductions \a p ^ q and ra p ^ q is defined asp — q 
and is always positive. The depth of reductions \a p ^ q and ra p ^ q corresponds to the index p. The degree of a transition 
system S, written deg(5), is the maximum degree among all its reductions. Analogously, the depth of a transition 
system S, written depth(S'), is the maximum depth among all its reductions. 

The next definition introduces a condition that allows us to remove spurious ambiguity from bottom-up shift-reduce 
parsers. Informally, the condition requires that the existence in the system of a reduction of some type involving stack 
positions p and q, p > q, always implies the existence in the system of reductions of the same type involving stack 
positions p 1 and q' with p' < p and q' < q. We need some additional notation. Let /z(la p< _ g ) be a set of transitions 
including la p _i^ g if p > q + 1, la p _i^ g _i if q > 1, and no other transition. Similarly, fi(ra p ^ q ) includes ra p _i^ g 
if p > q + 1, ra p _i^,j_i if q > 1, and no other transition. 

Definition Let S be a bottom-up shift-reduce transition system with set of transitions T. S is monotonic if for each 
teTwe have C T. 



Exampie 2. The transition-based parser o uAttardi can be written as the bottom-up shift-reduce system with 



transitions sh, la^^-i and ra^i for every p with 2 < p < d, d = dcpth(S'). The system with depth 3, as used by 



Kuhlmann and Nivre I 201(\l . Cohen et al. I 201l\l . has transitions sh, Ia2<-i, ra2-s.i, Ia3<_i and ra3_j.i. 

These systems are monotonic for every value ofd, since for a transition la p< _i, we have that /i(la p< _i ) = 
(if p > 2) or (otherwise), and therefore /i(la p< _i) is included in T. The same also holds for /i(ra p _>.i). 



{Ia p -i<-i} 



The monotonicity property is crucial for the main result of this paper: if a bottom-up shift-reduce transition system 
is monotonic, we can systematically obtain an equivalent system without spurious ambiguity, as described in the next 
section. 



3 Removal of Spurious Ambiguity 

Let S be a bottom-up shift-reduce transition system that is monotonic. We show how we can systematically obtain a 
new transition system S' without spurious ambiguity that is equivalent to S, that is, S' parses the same set of trees as 
S. In essence, this is the main result of this paper, which can be formally stated as follows: 
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Theorem 3. Any transition system S which is bottom-up shift-reduce and monotonic, can always be converted into 
an equivalent transition system S' that does not have spurious ambiguity, such that: 

(i) for each complete computation 7' of S' on w there is a complete computation 7 of S such that D(pf) — D("f'); 
and 

(ii) for each complete computation 7 of S on w there is a complete computation 7' of S' such that D("/) = D(Y). 
Next, we describe how S' is created, and give full formal proofs of this theorem in S|4] 

3.1 Stack Symbols 

Recall that in S each stack symbol is an integer i representing the word occurrence in the input string. Each stack 
symbol in S' is obtained by annotating i with the following Boolean features: 

• a feature i.stop indicating whether, in the current analysis, the word at has collected all of its dependents (T) or 
it is still seeking some of them (F); 

• for each k with 1 < k < deg(S), a feature i.leftfc indicating that a left reduction is allowed (T) or forbidden (F) 
between symbol i and the symbol k positions below i in the stack; 

• for each k with 1 < k < deg(S'), a feature z.right fc indicating that a right reduction is allowed (T) or forbidden 
(F) between symbol i and the symbol k positions below i in the stack. 

We now introduce some predicates that will be used later to define the new transition system S'. Let i and j be 
stack symbols of S'. The predicate bu(i, j) = -li.stop A j.stop indicates whether a bottom-up link from node i to 
node j is admissible in the current configuration, i.e., whether node i can accept a dependent and node j has already 
collected all of its dependents. Assume that i and j are located at stack positions p and q, respectively, with p > q. 
Then the predicates^ 

left(i,i;p,g) = j.\eft {p _ q) A bu(j,i) A \a p ^ q e T, 
right(i, j;p,q) = j.r\ght (p _ q) A bu(i,j) A ra p ^ q E T 

indicate that reductions \a pJ ^ q and ra p ^. q , respectively, are available in the current configuration, i.e., these reductions 
can be performed by the parser. As we will see later, the notion of available reduction plays a crucial role in the 
construction of S'. 

3.2 Transitions 

The basic idea underlying the construction of S' is to perform reductions as early as they become available in a 
computation, according to the notion of available reduction that we have just introduced. This is implemented as 
follows. 

We define a priority relation among transitions in T such that, in choosing between several reductions that are 
compatible with some dependency tree, we give highest priority to the reduction with its dependent closest to the top 
of the stack. This reduction is necessarily unique, given that in a dependency tree each dependent has a unique head. 
The shift transitions are always assigned the lowest priority. 

Note that the priority relation can be seen as a partial order between reductions, but the set of reductions that are 
compatible with a given tree is totally ordered, due to the restriction that a node cannot have more than one head. 

In the new transition system S' we simulate S as follows. Given a configuration c[ of S' representing a configura- 
tion ci of S, we consider the set T C1 of all transitions from S that are available at c\. We nondeterministically choose a 
transition t € T Cl and simulate it on c[ under S", moving into a new configuration c' 2 . Most important, in c' 2 we set the 

2 Here we are overloading symbols left and right, with related meanings: it will always be clear from the context whether these symbols refer 
to features or else to predicates. 
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feature of the stack symbols in such a way that all transitions in T Cl that had higher priority than t are now blocked, 
meaning that no computation spanning from c' 2 will ever be able to apply such transitions. We can now specify our 
construction. 

For a stack symbol i of S, we write z[T] to denote the stack symbol of S' such that i.ip = T for every feature ip. 
For a feature tp and a value v, we write j = i[ip 4— v] if j.ip = v and j.ip' = i.ip' for every other feature tp' . We 
generalize this notation to a set of features F, and write j = i[ip <— v \ tp G J-] if j.ip = v for each tp G T and 
j.ip = i.ip for each tp g" J 7 . Finally, as a shorthand, we write i[tp 4— v \ tp G T\tp' <— v' \ tp' & J-'] in place of 

(i[ip <- v | tp g J"])[<^' «- u' | e J 7 ']. 

The system 5' obtained by removing spurious ambiguity from S has a set of transitions T" including all and only 
the transitions reported below, where 6 is depth(5): 

sh s :(<j\is\is-i\ ■ ■ ■ \h,i\P,A) h 
(a\i' s \i' s _ 1 \...\i' 1 \i',p,A) 

where we let i' = i[t], and for every u with 1 < u < 5 we let 

i' u = i u [left fc 4- F | left(i u+ fc,£ u ;u + fc,u); 

right fc ^ F | right(i u+fc ,i„;w + fc,w)]. 

Transition sh s simulates a shift of S. The superscript s means that the new symbol i' added to the stack has the 
feature stop set to T, that is, we (nondeterministically) guess that i' is now ready for bottom-up reduction. Since 
the shift transition has always the lowest priority in S, sh s blocks any reduction that was available in the antecedent 
configuration, by setting the features of each i' q , as indicated above. 

We also add to T' a transition sh s defined exactly as sh s but with the only difference that we let i' = (£[l])[stop «— 
F], that is, we guess that node i' is still seeking dependents in the current analysis. 

For each ra p ^ q in T, we add to T' 

\...\h,P,A)\- 
(a\i' p \ . . . |i' 9+1 |i' ? _i| ...\i[,p,AU {i p -> i q }) 

which can only be applied under the precondition right(i p , i q ;p, q). Here we let i' = i p [stop <— T], and for every u 
with 1 < u < d we let 

i' u = iu[left fc ^F | u + k < qA left i u \ u + k, u); 

right fe i— F | u < q A right(i u+fc , i u ; u + k, u)]. 

As for the shift transition, we also add to T" a transition rap_j. ? defined exactly as rap-> q but with i' = i p [stop <— F] . 
Reductions ra^g and ra*^ g block every reduction t allowable in the antecedent configuration that has priority higher 
than the reduction p —> q, that is, with a dependent at a position closer to the top of the stack than q. 

Similarly to the above, for each \a p ^ q in T we add to T 1 

la p^ g : (°1*pIV-i| ■ • ■ \ii,P, A ) ^ 

(<r\i p „ 1 \...\i' 1 ,p,Au{i q ^i p }) 

which can only be applied under the precondition left(i p , i q ;p, q). Here we let i' = i g [stop T], and for every u 
with 1 < u < d we let 

i' u = iu[leftfc <- F | u + k < p A\eft(i u+k ,i u ;u + k,u); 
right fc <- F | u < p A r\ght(i u+k ,i u ; u + k,u)]. 

We also add to T' a transition la^^ defined exactly as \a p <_ q but with i' q = i g [stop <— F]. 

The initialization function and final configuration set of 5" are like those of S, but we have to specify feature values 
for the stack symbol corresponding to the dummy root node 0: all its features will be F in the initial configuration, and 
in final configurations it must have the left^ and rights features set to F but stop set to T. 
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Example 4. If we apply the transformation defined in this section to remove spurious ambiguity from the arc-standard 
transition system of Example [7J we obtain a system S 1 where the only valid computation for the tree with arcs 
{0->2,2->- 1,2^-3} is 

(I(w); sh s , sh s , laf^, sh s , raf^, raf^), 

which builds the arcs in the same order as the computation (i) of Example^ 

It is easy to check that an alternate computation building the arcs in the order of the computation { ii) does not 
exist in S'. Such a computation would have to start with the transitions sh s , sh s , sh s , ral^ (the need to use the s ors 
variant of each configuration is uniquely determined by whether nodes have pending dependents or not). 

However, after applying these transitions the parser will be in a configuration ([0, 1, 2], [], 0) with: 

0. stop = F, O.lefti = F, 0. right! = F, 

1. stop = T, l.lefti = T, 1. right! = F, 

2. stop = F, 2. left i = F, 2. right! = T. 

At such configuration, the feature value 2.lefti = F blocks the left reduction creating the arc 2—^1. This is so because 
the sh s transition that moved the node 3 to the stack set this value to F, blocking this left reduction since it could have 
been executed at that point with higher priority than sh s . 

4 Formal Properties and Proofs 

We now proceed to prove that the described transformation for the removal of spurious ambiguity is correct (i.e. prove 
Theorem[3]). To do so, we first show that transition systems S and S' defined as in $3] are equivalent, i.e., they assign 
the same set of trees to any input string. Afterward, we show that 5" has no spurious ambiguity, i.e., different complete 
computations of S' will always produce different dependency trees. 

4.1 Equivalence of Unambiguous System to Original System 

Let S and 5" be defined as in Section[5] with associated transition sets T and T', respectively. To show that S and S' 
are equivalent, we need to prove that for every input string w 

(i) for each complete computation 7' of 5" on w there is a complete computation 7 of S such that D(j) = D(j'); 
and 

(ii) for each complete computation 7 of S on w there is a complete computation 7' of 5" such that D(-f) = D(j'). 

The proof of (i) is rather straightforward. We show a mapping from the complete computations of 5" to the 
complete computations of S that preserves the associated trees. We define a homomorphism r from T" to T by letting 

T O a p<-(j) = T (' a p<-q) = ' a p<-<?i 

r(sh s ) = T(sh") = sh, 

and extend it to (complete) computations (recall that we represent a computation by its initial configuration and its 
sequence of transitions) by letting r((co; t\, . . . , t m )) = (cq; r(ti), . . . , T{t m )). 

It is not difficult to see that if 7 is complete, then t(^) is also complete. Furthermore, this mapping preserves trees, 
i.e., for any computation 7 of S' we have D(j) = D(t ("/)), because transitions t G T" and r(t) G T create the same 
arc, if any. This concludes the proof of (i). 

To prove statement (ii) above, let 7 = cq, . . . ,c m = (cq; t\, . . . , t m ) be a complete computation of S for an input 
string w, and let A 1 be the set of arcs in D(^). We show that we can always find a computation 7' of S' such that 
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D("f') = D(^). To do this, we introduce below the notion of canonical computations of S. Then we proceed in two 
steps: first we transform 7 into a canonical computation 7/ of S equivalent to 7, and then we transform 7/ into an 
equivalent computation 7' of S 1 . 

Consider a configuration c^, < k < m, appearing in 7. Let IZk.y be the set of reductions of S that can be applied 
to c/c, and that are compatible with D(j), i.e., these reductions construct an arc [h — > d) € Ay. Here a/j is the head 
word and ad is the dependent word, both from w. 

Assume that IZk.j 7^ 0, and let t p be the reduction in lZk n with the highest priority. This means that t p is the 
reduction in IZk.-y with dependent node d placed at the position closest to the top in the stack associated with Ck or, 
equivalently, the reduction with the largest value of index d in w. Note that there cannot be more than one such 
reduction, due to the single-head constraint in D(j). 

We say that Ck is a troublesome configuration in 7 if tk+i 7^ t p . This means that t^+i is either a shift transition, 
or else a reduction in lZk n that, when applied to Ck, creates a dependency link h' — > d' with d 1 < d, i.e., a reduction 
with lower priority than t p , since node d! will be placed at a deeper position than node d in the stack associated with 

Cfc. 

We say that a computation of S is in canonical form if it does not contain any troublesome configuration. This 
means that, at each configuration of a canonical computation, the reduction in TZk n with the highest priority is 
taken, in case set IZk.-y is not empty. We now show that for every computation 7 of S there exists an equivalent 
canonical computation 7/ of S. We show how to eliminate the leftmost troublesome configuration in 7; iteration of 
this process will always produce a computation where no configurations are troublesome. 

Let Cfc be the leftmost troublesome configuration in 7. We show that we can build a computation of S which is 
equivalent to 7, and such that its first k configurations are not troublesome. The transition sequence ifc+i, . . . , t m can 
be written in the form 

ifc+l, ifc+2, ■ • ■ , tj-l,t' p , tj + li ■ ■ ■ j tm 

where t' is a reduction creating the same link h — > d that should have been created by the reduction t p G 1Zk,~/ with 
the highest priority. Note that reduction t' p must take place at some cj in 7 with j > k + 1, because /t, — 5- d is in D(j), 
and this link cannot be present in the arc set associated with Ck (if it were, the reduction t p could not be available at Ck 
because d would not be in the stack at that configuration). 

The sequence tk+i, ■ ■ ■ , t m in 7 can then be replaced (generating the same tree) with 

tp,T d (t k+ i), . . . ,Td(tj-l),tj + l,. . . ,t m 

where Td(t) represents the transition that creates the same arc in a stack where the node j has been removed as t 
would create in a stack where the node j is present. Formally, for a transition applied at a configuration c with stack 
er|i p | . . . \i q \ . . . \ii, we define r^sh) = sh and 

. q if i p > d and i q > d, 
if i p < d and i q > d, 
if i p < d and i q < d. 

!l a p<-(j if ip > d and i q > d, 
la p -i^,j if i p < d and i q > d, 
la p _ 1 ^ (? _ 1 if i p < d and i q < d. 

Note that, since S is monotonic, the existence of a transition t implies the existence of Td(t). 

The computations -fk and 7 produce the same tree. Also, in 7^ the first k configurations are not troublesome, since 
applying the reduction t p at c k makes c k not troublesome, and by construction the configurations to the left of c k in 
7fc are not troublesome. 

By iteratively applying the above process, we eventually obtain a computation 7^ of S such that D(jf) = D(j). 
It then remains to show that we can obtain a computation 7' of S" with the same associated dependency tree as 7^. 

Let 7/ = (cq; ti, . . . , t m ) and assume that for each j, 1 < j < m, transition tj in 7^ applies to configuration 
Cj-i = {<y\i p \ ■ ■ ■ \i q \ ■ ■ ■ *o|/3, A). The computation 7' is obtained as 7' = (co; t[, . . . , t' m ), where for each j, tj is 
specified as follows. 



Td(ra p ^q) = < ra p _ 
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• If tj = ra p ^ q , then t'j is ra*^ if A 1 \ (A U {(i p , i q )}) contains a dependency link of the form (i p , u) for some 
u, and t'j is f3 s p -> q otherwise. 

• If tj = \a p ^ q , then t'j is \a p ^ q if i 7 \ (4 U i p )}) contains a dependency link of the form (i q , u) for some 
u, and ^ is \ap^ q otherwise. 

• If tj = sh, then t'j is sh s if A n \ A contains a dependency link of the form (ig, u) for some u, and i^- is sh s 
otherwise. 

It is not difficult to see that 7' is a valid computation of S' for w. This follows from the fact that the transitions t'j 
above satisfy the bu(i, j) predicates in 5", and the fact that in 7/ reductions are applied in accordance to the priority 
relation. We also observe that if 7/ is complete then 7' is complete as well. Finally, the fact that D(j') = D(jf) 
follows immediately from the above mapping from transitions tj to transitions t'j. This concludes the proof of (ii) and 
thus the proof of the equivalence of S and S'. 



4.2 Non-ambiguity of the Transition System 

To prove that our transformed system 5" has no spurious ambiguity, we need to show that different complete compu- 
tations of S' for w always produce different trees, i.e., if 71 ^ 72 are complete computations of S' for input string w, 
thenD(7i) ^ D{ l2 ). 

To do so we write 71 as aci(3\ and 72 as ac 2 /3 2 , with a the common prefix among both computations, and c\, C2 
configurations such that c\ ^ c 2 . Note that a cannot be empty, since both computations must at least have the initial 
configuration I(w) in common. We call Co the last configuration in a, and ii, £2 the transitions that produce c%, c 2 
(respectively) from c$. We distinguish four cases below. 

Case 1: t\ and t 2 are transitions that differ only in the stop feature of some new node u in the configuration they 
produce. As an example, we have ti = la*^ and t 2 = la*^, which differ in the stop feature of node u = q. Without 
loss of generality, we assume u.stop = T in c\, and u.stop = F in c 2 . Let cq — (a, (3, A). Then D(j 2 ) must contain 
at least one arc originating from u that is not present in A, while I? (71) cannot contain any arc originating from u 
that is not already in A, because u.stop = T prevents the addition of dependents of u after t\ is executed. Therefore, 
£>(7i) + D( l2 ). 

Case 2: ti and t 2 are reduce transitions with different head nodes but the same dependent node u. In this D(7i) 7^ 
D(j 2 ) follows from the single-head constraint, since the node u will be assigned different heads in 71 and 72, respec- 
tively. 

Case 3: t\ and t 2 are reduce transitions involving different dependent nodes. Suppose that t\ creates the arc hi — > di 
and t 2 creates the arc h 2 — )• d 2 . Without loss of generality, we assume that d\ > d 2 , i.e., t\ has higher priority than 
t 2 . Then I? (71) contains the arc hi — )• d\, but D(^ 2 ) cannot contain this arc, since the system's features block its 
construction after the application of the transition t 2 at configuration cq. 

Case 4: t\ is a reduce transition and t 2 is a shift transition. The same reasoning of Case 3 applies: the arc hi — > di 
created by ti cannot appear in D(j 2 ), because the system's features block its construction after the shift transition t 2 
is applied. This concludes the proof that S' does not have spurious ambiguity. 



4.3 Complexity 



Let S be a bottom-up monotonic transition system, and let deg(5) = S. The construction in |j3]adds 26+1 binary 
features to each stack symbol of S. This results in 2 2 d+1 new symbols in S' for each stack symbol of S. While for 
projective dependenc y parsing we h ave 6 = 1, degree larger than one is needed in non-projective parsing. However, it 
has been observed bv lAttardi I2OO6II that most of the non-projective trees in the CoNLL data can be parsed with 6 = 2 
or 3. This means that, in practical cases, the blow-up of stack symbols by our construction can be considered a small 
constant. 

To discuss a concrete application, consider the non-projective system S of I Attardi , l2006ll . also shown in Exam- 
ple |2] restricted to 6 = 2, which is still heavily affected by spurious ambiguity. We have applied the construction in 
Sj3]to S with some ad-hoc optimization of the features for that system, resulting in a new system S' with a blow-up of 
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Language 


Size 


Attardi 


This paper 


Arabic 


1,460 


27 


2 


Bulgarian 


12,823 


47 


36 


Czech 


72,703 


1,334 


602 


Danish 


5,190 


179 


159 


Dutch 


13,349 


1,448 


1,018 


German 


39,216 


2,140 


1,538 


Japanese 


17,044 


121 


45 


Portuguese 


9,071 


295 


203 


Slovene 


1,534 


48 


27 


Spanish 


3,306 


11 


10 


Swedish 


11,042 


197 


105 


Turkish 


4,997 


208 


102 



Table 1 : Coverage of Attardi's oracle versus the coverage of our oracle for various treebanks from the CoNLL 2006 
data sets IBuchholz and Marsi , l2006ll . "S ize" denotes the number of sentences in the treebank (we used the training 
portion only), "Attardi" denotes the number of sentences that Attardi's oracle could not parse and "this paper" denotes 
the number of parse trees that our oracle could not parse. 



stack s ymbols of 2 5+1 = 8. This means that we can apply to S' the inside/outside algorithm presented in lCohen et al 
1 201 1 1, working in time 0(|w| 7 ) for an input string w, with an extra hidden constant of 8. 



5 Experiments 

As mentioned earlier, transition-based dependency parsing uses an oracle to convert training data which consists of 
pairs of sentences and dependency trees to pairs of sentences with shift-reduce sequences, in order to sidestep the issue 
of spurious ambiguity. The new training data is then used to train multi-class classifiers. In several cases, oracles are 
based on heuristics and are incomp l ete. T he oracle that is provided in the DeSR dependency parsing package^ which 
is based on the parser from lAttardil 12006J, is an example for such incomplete heuristics. 

We compared the coverage of Attardi's oracle, restricted to transitions of degree at most 2, to the oracle of an 
equivalent transition system without spurious ambiguityQ Our findings are given in Table Q] As theoretically guaran- 
teed, there were no cases where Attardi's parser recognized a tree using transitions of degree 2, and our oracle did not 
recognize it. The reverse, however, holds quite often. 



6 Discussion 



We note that mo notonic bottom-up shift-reduce transition systems can be made probabilistic and generative, in a 
manner similar to lCohen et al.l 1201 ill . The issue with spurious ambiguity is especially crucial with gener ative models 



in the unsupervised setting, when using algorithms such as the expectation-maximization (EM) algorithm. Cohen et al 



1 201 111 describe an EM algorithm for the system from lAttardil 0200611 . which can be extended to any monotonic bottom- 
up transition system. The EM algorithm they describe can be further extended to monotonic bottom-up transition 
systems after removal of spurious ambiguity (as we describe in this paper), making these systems readily available for 
transition-based unsupervised learning for dependency parsing. 



Ihttp : / /desr . sourcef orge ■ net"7] 

4 Note t hat the algorithm implemented in the latest version of DeSR, which we used for these experiments, differs from the description provided 
in Attardi [2006] and Example|2]in that Ia3<_i and ra3_>i transitions push a node from the stack back to the bu ffer after reducing. This does not 
affect our method to remove spurious ambiguity, which is correct both for the version described in Attardi [2006] and for the latest implementation 
of Attardi's parser. 
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7 Conclusion 



We provided a principled treatment to the issue of spurious ambiguity in transition-based dependency parsing. We 
defined a large class of transition systems, which we c all monotonic bottom-up shift-reduce transiti on systems, tha t 
cover existing systems such as the arc-standard parser of iNivrd l2008h and the non-projective parser of lAttardj l2006ll . 
as well as systems in which reductions affect elements at positions in the stack deeper than the topmost element 
I Goldberg and Elhadadl lioioll . We then showed how to eliminate spurious ambiguity from these systems. Our tech- 
nique has applications for unsupervised and supervised dependency parsing. The transition model that we present can 
be used as a substitu te for models such as the dependency model with valence that have long been used for dependency 
grammar induction IKlein and Manning! 12004 ICohen and Smithll2010llSpitkovskv etailboioll . 

In this paper we have discovered some sufficient conditions under which spurious ambiguity can be removed 
from bottom-up dependency transition systems, which we hope are as "tight" as possible. However, our technique 
does not work for all dependency transition systems, and it remains an open problem to show whether removal of 
spurious ambiguity can be carried out in the general case. There might as well be dependency parsing strategies for 
which removal of spurious ambiguity is not only difficult, but simply impossible. A similar scenario is observed, for 
instance, for structural ambiguity in context-free gramma rs, where some contex t-free languages can only be generated 
using ambiguous context-free grammars; see for instance iHopcroft et al.l 1200611 . 
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